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Abstract 

Much of the appeal of music lies in its power to convey emo¬ 
tions/moods and to evoke them in listeners. In consequence, the past 
decade witnessed a growing interest in modeling emotions from mu¬ 
sical signals in the music information retrieval (MIR) community. In 
this article, we present a novel generative approach to music emotion 
modeling, with a specific focus on the valence-arousal (VA) dimension 
model of emotion. The presented generative model, called acoustic 
emotion Gaussians (AEG), better accounts for the subjectivity of 
emotion perception by the use of probability distributions. Specif¬ 
ically, it learns from the emotion annotations of multiple subjects a 
Gaussian mixture model in the VA space with prior constraints on the 
corresponding acoustic features of the training music pieces. Such a 
computational framework is technically sound, capable of learning in 
an online fashion, and thus applicable to a variety of applications, in¬ 
cluding user-independent (general) and user-dependent (personalized) 
emotion recognition and emotion-based music retrieval. We report 
evaluations of the aforementioned applications of AEG on a larger- 
scale emotion-annotated corpora, AMG1608, to demonstrate the ef¬ 
fectiveness of AEG and to showcase how evaluations are conducted 
for research on emotion-based MIR. Directions of future work are also 
discussed. 
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1 Introduction 


Automatic music emotion recognition (MER) aims at modeling the associ¬ 
ation between music and emotion so as to facilitate emotion-based music 
organization, indexing, and retrieval. This technology has emerged in recent 
years as a promising solution to deal with the huge amount of music informa¬ 
tion available digitally |T,22,30,73 . It is generally believed that music cannot 


be composed, performed, or listened to without affection involvement 29 


The pursuit of emotional experience has also been identihed as one of the 
primary motivations and benehts of music listening 27 . In addition to mu¬ 


sic retrieval, music emotion also hnds applications in context-aware music 
recommendation, playlist generation, music therapy, and automatic music 
accompaniment for other media content, including image, video, and text, 
amongst others |Ml[47l[^[77] . 

Despite of the signihcant progress that has been made in recent years, 
MER is still considered as a challenging problem because the perception of 
emotion in music is usually highly subjective. A single, static ground-truth 
emotion label is not sufficient to describe the possible emotions different peo¬ 
ple perceive in the same piece of music 14,23 . On the contrary, it may be 


more reasonable to learn a computational model from multiple responses of 
different listeners 43 and to present probabilistic (soft) rather than deter¬ 
ministic (hard) emotion assignments as the hnal result. In addition, the 
subjective nature of emotion perception suggests the need of personalization 
in systems for emotion-based music recommendation or retrieval 74 . Early 


work on MER often chose to sidestep this critical issue by either assuming 
that a common consensus can be achieved 22,66 , or by simply discarding 


music pieces for which a common consensus cannot be achieved 35 


To help address this issue, we have proposed a novel generative model 


referred to as acoustic emotion Gaussians (AEG) in our prior work 61-65 


The name of the AEG model comes from its use of multiple Gaussian dis¬ 
tributions to model the affective content of music. The algorithmic part of 
AEG has been hrst introduced in 63 , along with the preliminary evaluation 
of AEG for MER and emotion-based music retrieval. More details about 
the analysis part of the model learning of AEG can be found in a recent 
Due to the parametric nature of AEG, model adaptation tech- 


article 65 


niques have also been proposed to personalize an AEG model in an online, 
incremental fashion, rather than learning from scratch [6,64 . The goal of 


this article is to position the AEG model as a theoretical framework and to 
provide detailed information about the model itself and its application to 
personalized MER and emotion-based music retrieval. 

which 


We conceptualize emotion by the valence-arousal (VA) model 45 
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has been used extensively by psychologists to study the relationship be¬ 
tween music and emotion 13,52 . These two dimensions are found to be 


the most fundamental through factor analysis of self-report of human’s affec¬ 
tive response to music stimulus. Despite differences in nomenclature, existing 
studies give similar interpretations of the resulting factors, most of which cor¬ 
respond to valence (or pleasantness; positive/negative affective states) and 
arousal (or activation; energy and stimulation level). For example, happiness 
is an emotion associated with a positive valence and a high arousal, while 
sadness is an emotion associated with a negative valence and a low arousal. 
We refer to the 2-D space spanned by valence and arousal as the VA space 
hereafter. Moreover, we are concerned with the emotion an individual per¬ 
ceives as being expressed in a piece of music, rather than the emotion the 
individual actually feels in response to the piece. This distinction is neces- 
as we do not necessarily feel sorrow when listening to a sad tune. 
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sary 
for example. 

As the focus of this article is on dimensional emotion values such as va¬ 
lence and arousal values, we refer interested readers to (1,20,53 for studies 
and surveys on categorical MER research that views emotions as discrete la¬ 
bels such as mood tags. We also note that people have proposed approaches 
to model the relationship between discrete emotion labels and the dimen¬ 
sional VA values 


46,61 , which is also beyond the scope of this article. 


The article is organized as follows. We hrst review related work in Section 
1^ Then, we present the mathematical derivation of AEG and the learning 
algorithm in Section (^ followed by the personalization algorithm in Section 
1^ Sections!^ and [^present applications of AEG to MER and emotion-based 
music retrieval, respectively. Finally, we conclude in Section 


2 Related Work on Dimensional Music Emo¬ 
tion Recognition 

Early approaches to MER (^[^ assumed that the perceived emotion of a 
music piece can be represented as a single point in the VA space, in which the 
valence and arousal values are considered as independent numerical values. 
The ground-truth VA values of a music piece is obtained by averaging the an¬ 
notations of a number of human subjects, without considering the covariance 
of the annotations. To predict the VA values of a music piece, a regression 
model can be applied. Given N inputs (xj,?/*), i = l,...,iV, where Xj is 
a D-dimensional feature vector of the i-th input segment, D the number of 
feature descriptors, and Pi the valence or arousal value, a regression model 
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is learned by algorithms such as support vector regression (SVR) that 
minimize the mismatch (e.g. mean squared loss) between the predicted and 
the ground-truth VA values. 

As emotion perception is rarely dependent on a single music factor but a 
combination of them 18,28 , algorithms used feature descriptors that char¬ 


acterize the loudness, timbre, pitch, rhythm, melody, harmony or lyrics of 
In particular, while it is usually easier to predict arousal 


music 


19,40,50,53 


using, for example, loudness and timbre features, the prediction of valence 

Cross cultural aspects of emotion 


has been found more challenging 53 68 72 


perception have also been studied 


21 


To exploit the temporal continuity of 


emotion variation within a piece of music, techniques such as system identi¬ 
fication 31 , conditional random fields 24,49 , hidden Markov models 37 


deep recurrent neural networks [^, or dynamic probabilistic model 60 have 
also been proposed. Various approaches and features for MER have been 
evaluated and compared using benchmarking datasets comprising over 1,000 
Creative Commons licensed music pieces from the Free Music Archive, in the 
2013 and 2014 MediaEval ‘Emotion in Music’ tasks 


55,56 


Recent years have witnessed growing attempts to model the emotion of 
a music piece as a probability distribution in the VA space to 

better account for the subjective nature of emotion perception. For instance. 
Figure [T] shows the VA values applied by different annotators to four music 
pieces. To characterize the distribution of the emotion annotations for each 
clip, a typical way is to use a bivariate Gaussian distribution, where the 
mean vector presents the most possible VA values and the covariance matrix 
indicates its uncertainty. For a clip with highly subjective affective content, 
the determinant of the covariance matrix would be larger. 

Existing approaches to predicting the emotion distribution of a music clip 


from acoustic features fall into two categories. The heatmap approach 49,71 


quantizes each emotion dimension by W equally spaced cells, leading to a 
W xW grid representation of the VA space. The approach trains regres¬ 
sion models for predicting the emotion intensity of each cell. Higher intensity 
at a cell indicates that people are more likely to perceive the corresponding 
emotion from the clip. The emotion intensity over the VA space creates a 
heatmap-like representation of emotion distribution. However, heatmap is 
not a continuous representation of emotion and emotion intensity cannot be 
strictly considered as a probability estimate. 

on the other hand, models 


The Gaussian-parameter approach 48,71 


emotion distribution of a clip as a bivariate Gaussian and trains multiple 
regressors, each for a parameter of the mean vector and the covariance matrix. 
This makes it easy to apply lessons learned from modeling the mean VA 
values. In addition, performance analysis of this approach is easier; one 
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valence valence valence valence 


Figure 1: Subjects’ annotations of the perceived emotion of four 30-second 
clips, which from left to right are Dancing Queen by ABBA, Civil War 
by Guns N’ Roses, Suzanne by Leonard Cohen, and All I Have To Do Is 
Dream by the Everly Brothers. Each circle here corresponds to a subject’s 
annotation, and the overall emotion for a clip can be approximated by a 2-D 
Gaussian distribution (the red cross and blue ellipse). Note that throughout 
this article we use the contour of an ellipse to outline the standard deviation 
of the corresponding Gaussian distribution. 


can analyze the importance of different acoustic features to each Gaussian 
parameter individually. However, since the regression models are trained 
independently, the correlation between valence and arousal is not exploited. 
The parameter estimation of the mean and variance is disjoined as well. 

A different methodology to address the subjectivity is to call for a user- 
dependent model trained on annotations of a specific user to personalize the 
emotion prediction 78 -80 . In 78 , two personalization methods are pro¬ 


posed; the first trains a personalized MER system for each individual specif¬ 
ically, whereas the second groups users according to some personal factors 
(e.g. gender, music experience, and personality) and then trains group-wise 
MER system for each user group. Another two-stage personalization scheme 
has also been studied the hrst stage estimates the general perception 
of a music piece, whereas the second one predicts the difference between the 
general perception and the personal one of the target user. 

We note that none of the aforementioned approaches renders a strict 
probabilistic interpretation 65 . In addition, many existing work is developed 


on discriminative models such as multiple linear regression and SVR. Few 
attempts are made to develop a principled probabilistic framework that is 
technically sound for modeling the music emotion and that permits extending 
the user-independent model to a user-dependent one, preferably in an online 
fashion. 

We also note that most existing work focuses on the annotation aspect 
of music emotion research, namely MER. Little work has been made to the 
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Figure 2: Illustration of the generative process of the AEG model. 


retrieval aspect - the development of emotion-based music retrieval systems 
73 . In what follows, we present the AEG model and its applications to the 


both of these two aspects. 


3 Acoustic emotion Gaussians: A Generative 
Approach for Music Emotion Modeling 

In IbhfjbS] , we proposed AEG, which is fundamentally different from the ex¬ 
isting regression or heatmap approaches. As Figure shows, AEG involves 
the generative process of VA emotion distributions from audio signals. While 
the relationship between audio and music emotion may sometimes be com¬ 
plicated and difficult to observe directly from an emotion-annotated corpus, 
AEG uses a set of clip-level latent topics to resolve this issue. 

We hrst dehne the terminology and explain the basic principle of AEG. 
Suppose that there are K audio descriptors {Ak}k=ii each is related to some 
acoustic feature vectors of music clips. Then, we map the associated feature 
vectors of Ak to a clip-level topic Zk- To implement each A^, we use a 
single Gaussian distribution in the acoustic feature space. The aggregated 
Gaussians of {Ak}k=i is called an acoustic GMM (Gaussian mixture model). 
Subsequently, we map each Zk to a specific area in the VA space, which is 
modeled by a bivariate Gaussian distribution Gk- We refer to the aggregated 
Gaussians of {Gk}k=i as an affective GMM. Given a clip, its feature vectors 
are first used to compute the posterior distribution over the topics, termed 
as a topic posterior representation 6. In 0, the posterior probability of Zk 
(denoted as 9k) is associated with Ak and will then be used to show the clip’s 
importance to Gk- Gonsequently, the posterior distribution 6 = {Okj^^i can 
be incorporated into learning the affective GMM as well as making emotion 
prediction for a clip. 
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AEG-based MER follows the flow depicted in Figure]^ Based on 0 of a 
test clip, we obtain the weighted affective GMM which is able to 

generate various emotion distribution. Following this sense, if a clip’s acoustic 
features can be completely described by the h-th topic Zh, i.e. 9h = 1, and 
6k = 0, V/c 7^ h, then its emotion distribution would exactly follow Gh- As 
will be described in Section we can further approximate 6kGk by a 
single, representative affective Gaussian G for simplicity. This is illustrated 
in the rightmost of Figure 

Beyond valence and arousal, adding more dimensions (e.g. potency, or 
dominant-submissive) might help resolve the ambiguity between affective 
terms, such as anger and fear, which are close to one another in the second 
quadrant of the VA space [^[^. Although AEG can be easily extended to 
describe emotion in higher dimensions, we stay with the 2-D emotion model 
here again for simplicity. 

3.1 Topic Posterior Representation 

The topic posterior representation of a music clip is generated from its audio. 
We note that the temporal dynamics of audio signals is regarded as essential 
for human to perceive musical characteristics such as timbre, rhythm, and 
tonality. To capture more local temporal variation of the low-level features, 
we represent the acoustic features at a time instance in the segment-level, 
which corresponds to sufficiently long duration (e.g. 0.4 second). A segment- 
level feature vector x can be formed by, for example, concatenating the mean 
and standard deviation of the frame-level feature vectors within the segment. 
As a result, a clip is divided into multiple overlapped segments which are then 
represented by a sequence of vectors, {xi,..., x^}, where T is the length of 
the clip. 

To start the generative process of AEG, we hrst learn an acoustic GMM 
as the bases to represent a clip. This acoustic GMM can be trained using 
the expectation-maximization (EM) algorithm on a large set of segment-level 
vectors J- extracted from existing music clips. The learned acoustic GMM 
dehnes the set of audio descriptors {Ak]k=i) and can be expressed as follows. 


K 



( 1 ) 


k=l 


where Ak{-) is the fc-th component Gaussian distribution, and tta,, nifc, and 
Sfc are its corresponding prior weight, mean vector, and covariance matrix, 
respectively. Note that we substitute equal weight for the GMM (i.e. tt^ = 
Vfc), because the original Hk learned from does not imply the prior 


7 


distribution of the feature vectors in a clip, 
in better performance as pointed in 


Such a heuristic usually results 


Suppose that we have an emotion annotated corpus X consisting of N 
music clips Given a clip Sj = we then compute the segment- 

level posterior probability for each feature vector in Si based on the acoustic 
GMM, 


p{Ak I yii,t) = 


Akiyjj^t I nifc,Sfc) 


( 2 ) 


Finally, the clip-level topic posterior probability Oi^k of s, can be approxi¬ 
mated by averaging the segment-level ones, 


Oi,k ^ p{zk 


1 

* t=i 


(3) 


This approximation assumes that 9i^k is equally contributed by each segment 
of Si and thereby capable of representing the clip’s acoustic features. We use 
a vector 6i G whose fc-th component is 6i^k, as the topic posterior of Sj. 


3.2 Prior Model for Emotion Annotation 

To consider the subjectivity of emotional responses of a music clip, we ask 
multiple subjects to annotate the clip. However, as some subjects’ annota¬ 
tions may not be reliable, we introduce a user prior model to quantify the 
contribution of each subject. 

Let ejj G (a vector including the valence and arousal values) denote 
one of the annotations of Sj given by the j-th subject, and let Ui denote the 
number of subjects who have annotated s*. Note that and where 
q r, may not correspond to the same subject. Then, we build the user 
prior model 7 to describe the conhdence of e*j in Sj using a single Gaussian 
distribution, 

I Si) = G{eij I ai,Bi), (4) 

where sn = -A etj, Bi = A - a.i){eij - Ui)^, and G(e | Ui, Bi) 

is called the annotation Gaussian of s*. One can observe what a* and Bj look 
like from the four example clips in Figure Empirical results show that a 
single Gaussian performs better than a GMM for setting up 7 (-) (^ . 

The conhdence of eij can be estimated based on the likelihood calculated 
by Eq. If an annotation is far away from the mean, it gives small likelihood 
accordingly. In addition to Gaussian distributions, any criterion that is able 
to rehect the importance of a user’s annotation of a clip can be applied to 7 . 




The probability of ejj, referred to as the clip-level annotation prior, can 
be calculated by normalizing the likelihood of ej ^ over the cumulative likeli¬ 
hood of all other annotations in Sj, 


p{eij I Si) 


7(e*j 


E.:i7(e 




(5) 


Based on the clip-level annotation prior, we further dehne the corpus-level 
clip prior to describe the importance of each clip, 


p{si I X) 


iKj I Sj) 

Z^q=l Z^r=l \ 



( 6 ) 


From Eqs. [^and|^we can make two observations. First, if a clip’s annota¬ 
tions are consistent (i.e. Bj is small), it is considered less subjective. Second, 
if a clip is annotated by more subjects, the corresponding 7 model should be 
more reliable. As a result, we can dehne the corpus-level annotation prior 
■jij for each ejj in the corpus X by multiplying Eqs. and 


li,j ^ I 


I 'Sj) 



^Uq 

Z_/r=l 


7(eg,r 



(7) 


which is computed beforehand and hxed in learning the affective GMM. 


3.3 Learning the Affective GMM 

Given a training music clip s, in the corpus X, we assume the emotional 
responses can be generated from an affective GMM weighted by its topic 
posterior 6i, 

K 

I ^ \ ^fc) y ( 8 ) 

k=l 

where Gk(-) is the fc-th affective Gaussian with mean and covariance S*, 
to be learned. Here 6i^k stands for the hxed weight associated with Ak to 
carry the audio characteristics of s*. We therefore call Oi an acoustic prior. 
Then, the objective function is in the form of the marginal likelihood function 
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of the annotations: 


N Ui 

p{E I A’,A) = I I Si)p{eij \ 6i,A) 

i=l j=l 

N Ui 

= EE p{si I X)p{eij I Si)p{eij I Oi,A) (9) 

i=i j=i 

N Ui K 

i=l j=l k=l 

where E = A = {sj, Oi}fL^, and A = {^i^, ^k}k=i is the param¬ 

eter set of the affective GMM. Taking the logarithm of Eq. and replacing 
p(eij I A) by leads to 


L log ^ ^ ^ ^ 'JiJ ^ ^ di,kGk{(ii,j I f^ki Sfc) , 


( 10 ) 


where — 1- To learn the affective GMM, we can maximize the 

log-likelihood in Eq. IT with respect to the Ganssian parameters. We hrst 
derive a lower bonnd of L according to Jensen’s ineqnality, 


L ^ Tbound ^ ^ ^ ^ i^S ^ ^ | (-tki Sfc) • (i-i-) 

i j k 

Then, we treat Tbound as a snrrogate of L and nse the EM algorithm to 
estimate the parameters of the affective GMM. In the E-step, we derive the 
expectation over the posterior distribntion of Zk for all the training annota¬ 
tions. 


Q = '^PiZk I Bij) 

i j k 


+ logG'fe(ei,, 



( 12 ) 


where 


P{^k I ^i,j) = 


Si,kCr I /i-Zc, S/;;) 

1 I ^^h’) ^h) 


(13) 


In the M-step, we hrst set the derivative of Eq. 12 with respect to Hk to zero 
and obtain the npdating form for the mean vector. 




I Bij) 


( 14 ) 
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Following a similar line of reasoning, we obtain the update rule for Sfc: 




E* E, li,A^k I eij) 


(15) 


Theoretically, the EM algorithm iteratively maximizes the Tbound value 


in Eq. H until convergence. One can fix the number of maximal iterations 
or set a stopping criterion for the increasing ratio of Tbound- 

Note that we can ignore the annotation prior by setting a uniform dis¬ 
tribution, i.e., Vi,j, 7 jj = 1. This case is called “AEG Uniform” in the ex¬ 
periment. In contrast, the case with non-uniform annotation prior is called 
“AEG AnnoPrior.” 


3.4 Discussion 


As Eqs. 14 and 15 show, the re-estimated parameters and are collec¬ 
tively contributed by ejj,V i, j, with the weights governed by the product 
of 'fij and p{zk \ Gij)- Gonsequently, the learning process seamlessly takes 
the annotation prior, acoustic prior, and annotation clusters over the current 
affective GMM into consideration. In such a way, the annotations of different 
clips can be shared with one another according to their corresponding prior 
probabilities. This can be a key factor that enables AEG to generalize the 
audio-to-emotion mapping. 

As the affective GMM is getting htted to the data, a small number of af¬ 
fective Gaussian components might overly £t to some emotion annotations, 
rendering the so-called singularity problem . When this occurs, the corre¬ 
sponding covariance matrices would become non-positive definite (non-PD). 
Imagining that when a component affective Gaussian is contributed by only 
one or two annotations, the corresponding covariance shape will become a 
point or a straight line in the VA space. To tackle this issue, we can remove 
the component Gaussian when it happens to produce a non-PD covariance 
matrix during the EM iterations |65] . 

We note that “early stop” is a very important heuristic while learning 
the affective GMM. We find that setting a small number for the maximal 
iteration (e.g. 7 - 11) or a larger stopping threshold for the increasing ratio 
of Tbound (e.g. 0.01) empirically leads to better generalizability. It can not 
only prevent the aforementioned singularity problem but also avoid overly 
fitting to the training data. Empirical results show that the accuracy of 
MER improves as the iteration evolves and then degrades when the optimal 
iteration number has reached 65 . Moreover, AEG AnnoPrior empirically 


converges faster and learns smaller covariances than AEG Uniform does. 
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4 Personalization with AEG 


The capability for personalization is a very important characteristic that 
completes the AEG framework, making it more applicable to real-world ap¬ 
plications. As AEG is a probabilistic, parametric model, it can incorporate 
personal information of a particular user via model adaptation techniques to 
make custom predictions. While such personal information may include per¬ 
sonal emotion annotation, user prohle, transaction records, listening history, 
and relevance feedback, we focus on the use of personal emotion annotations 
in this article. 

Because of the cognitive load for annotating music emotion, it is usually 
not easy to collect a sufficient amount of personal annotations at once to 
make the system reach an acceptable performance level. On the contrary, a 
user may provide annotations sporadically in different listening sessions. To 
this end, an online learning strategy is desirable. When the annotations 
of a target user are scarce, a good online learning method needs to prevent 
over-htting to the personal data in order to keep certain model generalizabil- 
ity. In other words, we cannot totally ignore the contributions of emotion 
perceptions from other users. Motivated by the Gaussian Mixture Model- 
Universal Background Model (GMM-UBM) speaker verification system [44] , 
we first treat the affective GMM learned from broad subjects (called back¬ 
ground users) as a background (general) model, and then employ a maximum 
a posteriori (MAP)-based method to update the parameters of the 

background model using the personal annotations in an online manner. The¬ 
oretically, the resulting personalized model will appropriately hnd a good 
trade-off between the target user’s annotations and the background model. 

4.1 Model Adaptation 

In what follows, the acoustic GMM will stay hxed throughout the personal¬ 
ization process, since it is used as a reference model to represent the music 
audio. In contrast, the affective GMM is assumed to be learned on plenty 
of emotion annotations from quite a few subjects, so it possesses a sufficient 
representation (well-trained parameters) for user-independent (i.e. general) 
emotion perceptions. Our goal is to learn the personal perception with re¬ 
spect to the affective GMM A accordingly. 

Suppose that we have a target user w* annotating M number of music 
clips denoted as A* = {e*, where e* and Oi are the emotion annotation 

and the topic posterior of a clip, respectively. We hrst compute each posterior 
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probability over the latent topics based on the background affective GMM, 


1 ^i,hCfki^^i I ^^h•> ^h) 

Then, we derive the expected sufficient statistics on over the posterior 
distribution of p{zk \ e*, 6i) for the mixture weight, mean, and covariance 
parameters: 




M 

Tk='^p{zk\ei,ei), (17) 

i=l 
1 ^ 

E(^ifc) =— y^^p{zk I e*, , (18) 

i k 

1=1 
1 ^ 

E(Sfc) =—'^p{zk I ei,0i)(ei - E(^fc)) (e^ - E(^fc))'^ . (19) 

^ k “7 


Finally, the new parameters of the personalized affective GMM can be ob¬ 
tained according to the MAP criterion 15 . The resulting update rules are 


the forms of interpolations between the expected sufficient statistics (i.e. 
E{pLk) and E{Y,k)) and the parameters of the background model (i.e. pLk 
and Sfc) as follows: 




K ^ «”E(^fc) + (1 - 

+ (1 ~ “I) (^fc + - fJ'kif^k)^ 

The coefficients a™ and aj. are data-dependent and are defined as 

Tfc V Tfc 


aP = 


rfc + /3” 


V 

au = 




( 20 ) 

( 21 ) 

( 22 ) 


where and (3'' are related to the hyper parameters 15 and thus should 
be empirically defined by users. Note that there is no need to update the 
mixture weights, as they are already occupied by the fixed topic posterior 
weights. 


4.2 Discussion 

The MAP-based method is preferable in that we can determine the inter¬ 
polation factor that balances the contribution between the personal anno¬ 
tations and the background model without loss of model generalizability, as 
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demonstrated by its superior effectiveness and efficiency in speaker adapta¬ 
tion tasks [^. If a personal annotation is highly correlated to 

a latent topic Zk (i.e. p{zk\em,0m) is large), the annotation will contribute 
more to the update of In contrast, if the user’s annotations have 

nothing to do with Zh (i.e. the cumulative posterior probability F/j = 0), the 
parameters of '^'h} would remain the same as those of the background 
model, as shown by the fact that would be 0. 

Another advantage of the MAP-based method is that users are free to 
provide personal annotations for whatever songs they like, such as the songs 
they are more familiar with. This can help reduce the cognitive load of 
the personalization process. As the AEG framework is audio-based, the 
annotated clips can be arbitrary and does not have to be those included in 
the corpus for training the background model. 

Finally, we note that the model adaptation procedure only needs to be 
performed once, so the algorithm is fairly efficient. It only requires K times 
of computing the expected sufficient statistics and updating the parame¬ 
ters. In consequence, we can keep rehning the background model whenever a 
small number of personal annotations are available, and readily use the up¬ 
dated model for personalized MER or music retrieval. The model adaptation 
method for GMM is not limited to the MAP method. We refer interested 
readers to for more advanced methods. 


5 AEG-based Music Emotion Recognition 

5.1 Algorithm 

As described in Section we predict the emotion distribution of an unseen 
clip by weighting the affective GMM using the clip’s topic posterior 0 = 
{0k}k=i as 

K 

p{e I 0) = ^ dkGki^ik, Sfc). (23) 

k=l 

In addition, we can also use a single, representative affective Gaussian G{fi, S) 
to summarize the weighted affective GMM. This can be done by solving the 
following optimization problem: 

K 

G{p.,t)), (24) 


Hiin 'S^9kDi^i^(^Gk{l-ik,'^k) 
A.s 


k=l 
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where 


Dki.{Ga II Gb) = -log | Ha^b I +(a*a—^ ts)—2 

(25) 

denotes the one-way (asymmetric) Kullback-Leibler (KL) divergence (a.k.a. 
relative entropy) from Ga{iJ'A,'^a) to Gb(^b,Sb). This optimization 
problem is strictly convex in p, and S, which means that there is a unique 


minimizer for the two variables, respectively 11 . Let the partial derivative 
with respect to p be 0, we have 


0k{2p - 2^k) = 0 . 

f ^ k 


(26) 


Given the fact that J2k = 1, we derive 


K 

A = X] ■ 

k=l 


(27) 


Setting the partial derivative with respect to S, ^ to 0, 


^ 9k (Sfc -il + {^lk- P) (Mfc - A)^) = 0 


we obtain the optimal covariance matrix by, 


K 


-\T 


^ ^ 9k ( Sfc T a) (MA a) 

k=l 


(28) 


(29) 


5.2 Discussion 

Representing the predicted result as a single Gaussian is functionally neces¬ 
sary, because it is easier and more straightforward to interpret or visualize 
the emotion prediction to the users with only a single mean (center) and 
covariance (uncertainty). However, this may run counter to the theoreti¬ 
cal arguments given in favor of a GMM that permits emotion modeling in 
hner granularity. For instance, it is inadequate for the excerpts whose emo¬ 
tional responses are by nature bi-modal. We note that in applications such 
as emotion-based music retrieval (cf. Section and music video genera¬ 
tion [^, one can directly use the raw weighted GMM (i.e. Eq. 23) as the 
emotion index of a song in response to queries given in the VA space. We 
will detail this aspect later in Section 

is quite efficient. The complexity 


The computation of Eqs. and | _ 

depends mainly on K and the number of frames T of a clip: computing 9k 
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requires KT operations (cf. Eq. [^, whereas computing fi and S requires K 
vector multiplications and K matrix operations, respectively. This efficiency 
is important for dealing with a large-scale music database and for application 
such as real-time music emotion tracking on a mobile device (^[4^[59|60[[^ . 

5.3 Evaluation on General MER 

5.3.1 Dataset 

We use the AMG1608 dataset for evaluating both general and personalized 
MER. The dataset contains 1,608 30-second music clips annotated by 665 
subjects (345 are male; average age is 32.0±11.4) recruited mostly from the 
crowdsourcing platform Mechanical Turk |^. The subjects were asked to 
rate the VA values that best describe their general (instead of moment-to- 
moment) emotion perception of each clip via the internet. The VA values, 
which are real values ranging in between [-1, 1], are entered by clicking on the 
emotion space on a square interface panel. The subjects were instructed to 
rate the perceived rather than felt emotion. Each music clip was annotated by 
15-32 subjects. Each subject annotated 12-924 clips, and 46 out of the 665 
subjects annotated more than 150 music clips, making the dataset a useful 
corpus for research on MER personalization. The average Krippendorff’s a 
across the music clips is 0.31 for valence and 0.46 for arousal, which are both 
in the range of fair agreement. Please refer to for more details about this 
dataset. 


5.3.2 Acoustic Features 


As different emotion perceptions are usually associated with different pat¬ 


terns of features 17 , we use two toolboxes, MIRtoolbox 33 andYAAFE 39 


to extract four sets of frame-based features from audio signals, including 
MFCC-related features, tonal features, spectral features, and temporal fea¬ 
tures, as listed in Table We down-sample all the audio clips in AMG1608 
at 22,050 Hz and normalize them to the same volume level. All the frame- 
based features are extracted with the same frame size of 50ms and 50% hop 
size. Each dimension in the frame-based feature vectors is normalized to zero 
mean and unit standard deviation. We concatenate all the four sets of fea¬ 
tures for each frame, as this leads to better performance in acoustic modeling 
in our pilot study [^. As a result, a frame-level feature vector contains 72 
dimensions of features. 

However, it does not make sense to analyze and predict the music emotion 


on a specific frame. Instead of bag-of-frames approach 57,58 , we adopt the 


16 








Table 1: Frame-based acoustic features used in the evaluation. 


Feature 

Dim. 

Description 

MFCCs 

40 

20 Mel-frequency cepstral coefficients and the first-order 
time differences 12 . 

Tonal 

17 

Octave band signal intensity using a triangular octave filter 
bank and the ratio of these intensity values 39 . 

Spectral 

11 

Linear predictor coefficients that capture the spectral enve¬ 
lope of the audio signal , spectral flux, and spectral 

shape descriptors (4^. 

Temporal 

4 

Shape and statistics (centroid, spread, skewness, and kur- 
tosis) p^. 

All 

72 

Concatenation of all the four types of features mentioned 
above. 


bag-of-segments approach for the topic posterior representation, because a 
segment is able to capture more local temporal variation of the low-level 
features. Our preliminary result has also conhrmed this hypothesis. To 
generate a segment-level feature vector representing a basic term in the bag- 
of-segments approach, we concatenate the mean and standard deviation of 16 
consecutive frame-level feature vectors, leading to a 144-dimensional vector 
for a segment. The hop size for a segment is 4 frames. Given the acoustic 
GMM (cf. Eq. [^, we then follow Eqs. and addressed in Section 3.1 to 
compute the topic posterior vector of a music clip. 


5.3.3 Evaluation Metrics 

The accuracy of general MER is evaluated using 3 performance metrics: two- 
way KL divergence (KL2) (^, E uclidean distance, and (also known as the 
coefficient of determination) |^. The hrst two measure the distance between 
the prediction and the ground truth. The lower the value is, the better the 
performance. KL2 considers the performance with respect to the bivariate 
Gaussian distribution of a chip, while the Euclidean distance is concerned 
with the VA mean only. is also concerned with the VA mean only. In 
contrast to the distance measure, a high R^ value is preferred. Moreover, R^ 
is computed separately for valence and arousal. 

Specihcally, we are given the distribution of the ground truth annotations 
Mi = G(aj,Bi) (cf. Section 3.2) and the predicted distribution of each test 


clip Mi = G{fiiy'^i)y both of wliich are modeled as a bivariate Gaussian 
distribution, where i G { 1 ,..., A^} denotes the index of a clip in the test 
set. Instead of one-way KL divergence (cf. Eq. 25) for determining the 
representative Gaussian, we evaluate the performance of emotion distribution 
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prediction based on the KL2 divergence defined by 


Dkl2{GaiGb) = -\DKh{GA il Gb) +-Dkl(G'b || Ga) 


(30) 


The average KL2 divergence (AKL), which measures the symmetric distance 
between the predicted emotion distribution and the ground truth one, is 
computed by ^ -DKL 2 (A/i,-^i)- Using the I 2 norm, we can compute the 

average Euclidean distance (AED) between the mean vectors of two Gaussian 
distributions by A ||aj — /i|| 2 . The statistics is a standard way to 
measure the fitness of regression models 
prediction accuracy as follows: 


54 . It is used to evaluate the 




Er=i (e. - e)^ 


(31) 


where e* and e* denote the predicted (either valence or arousal) value and the 
ground truth one of a clip, respectively, and e is the average ground truth 
value over the test set. When the predictive model perfectly fits the ground 
truth values, is equal to 1. If the predictive model does not fit the ground 
truth well, R^ may become negative. 

We perform three-fold cross-validation to evaluate the performance of 
general MER. Specifically, the AMG1608 dataset is randomly partitioned 
into three folds, and an MER model is trained on two of them and tested on 
the other one. Each round of validation generates the predicted result of one- 
third of the complete dataset. After three rounds, we will have the predicted 
result of each clip in the complete dataset. Then, AKL, AED, and the R^ 
for valence and arousal are computed over the complete dataset, instead of 
computing the performance over each one-third of the dataset. This strategy 
gives an unbiased estimate for R^. 


5.3.4 Result 


We compare the performance of AEG with two baseline methods. The first 
one, referred to as the base-rate method, uses a reference affective Gaussian 
whose mean and covariance are set using the global mean and covariance of 
the training annotations without taking into account the acoustic features. 
In other words, the prediction for every test clip would be the same for 
the base-rate method. The performance of this base-rate method can be 
considered as a lower bound in this task accordingly. Moreover, we compare 
the performance of AEG with SVR 51 , a representative regression-based 


approach for predicting emotion values or distributions, using the same type 
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Table 2: Performance evaluation on general MER (J, stands for smaller-better 


and t larger-better^ 


Method 

AKL i 

AED i 

Valence t 

Arousal f 

Base-rate 

1.2228 

0.4052 

-0.0009 

0.0000 

SVR-RBF 

0.7124 

0.2895 

0.1409 

0.6613 

AEG {K = 128) 

0.7049 

0.2890 

0.1601 

0.6554 

AEG {K = 256) 

0.7078 

0.2869 

0.1579 

0.6686 


of acoustic features. Specifically, the feature vector of a clip is formed by 
concatenating the mean and standard deviation of all the frame-level feature 
vectors within a clip, yielding a 144-dimensional vector. We use the radial 
basis function (RBF) kernel SVR implemented by the libSVM library [^, 
with parameters optimized by grid search with three-fold cross-validation on 
the training set. We further use a heuristic favorable for SVR to regularize 
every invalid predicted covariance parameter 65 . This heuristic signihcantly 
improves the AKL performance of SVR. 

Our pilot study empirically shows that AEG Uniform gives better emo¬ 
tion prediction in AED, compared to AEG AnnoPrior, possibly because the 
introduction of the annotation prior (cf. Eq. may bias the estimation of 
the mean parameters in the EM learning. In contrast, AEG AnnoPrior leads 
to better result in AKL, indicating its capability of estimating a more proper 
covariance for a learned affective GMM. In light of this, we use a following 
hybrid method to take advantage of both AEG AnnoPrior and AEG Uni¬ 
form in optimizing the affective GMM. Suppose that we have learned two 
affective GMMs, one for AEG AnnoPrior and the other for AEG Uniform. 
To generate a combined affective GMM, for its fc-th component Gaussian, we 
take the mean from the k-th Gaussian of AEG Uniform and the covariance 
from the fc-th Gaussian of AEG AnnoPrior. This combined affective GMM 


is eventually used to predict the emotion for a test clip with Eqs. and 
in this evaluation. 

Table [^compares the performance of AEG with the two baseline methods. 
It can be seen that both SVR and AEG outperform the base-rate method 
by a great margin, and that AEG can outperform SVR. For AEG, we can 
obtain better AKL and better E? for valence when K = 128, but better AED 
and better B? for arousal when K = 256. The best achieved for valence 
and arousal are 0.1601 and 0.6686. In particular, the superior performance 
of AEG in for valence is remarkable. Such observation suggests AEG 
a promising approach, as it is typically more difficult to model the valence 
perception from audio signals [70] . 

Figure presents the result of AEG when we vary the value of K (i.e. 
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(a) AKL, smaller-better. (b) AED, smaller-better. 




(c) of valence, larger-better. 


(d) of arousal, larger-better. 


Figure 3: Performance evaluation on general MER, using different numbers 
of latent topics in AEG. 


the number of latent topics). It can be seen that the performance of AEG 
improves as a function of K when K is smaller than 256, but starts to decrease 
when K is sufficient larger. The best result is obtained when K is set to 128 
or 256. As the parameters of SVR-RBF has also been optimized, this result 
shows that, if the optimal case of AEG is not attained (e.g., iP = 64 or 512), 
AEG is still on par with the state-of-the-art SVR approach to general MER. 

5.4 Evaluation on Personalized MER 

5.4.1 Evaluation Setup 

The trade-off between the number of personal annotations (feedbacks) and 
the performance of personalization is important for personalized MER. On 
one hand, we hope to have more personal annotations to more accurately 
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model the emotion perception of a particular user. On the other hand, we 
want to restrict the number of personal annotations so as to relieve the burden 
on the user. To reflect this, evaluation on the performance of personalized 
MER is conducted by fixing the test set for each user, but varying the number 
of available emotion annotations from the particular user to test how the 
performance improves as personal data amasses. 

We consider 41 users who have annotated more than 150 clips in this 
evaluation. We use the data of 6 of them for parameter tuning, and the data 
of the remaining 35 in the evaluation and report the average result for these 35 
test users. One hundred annotations of each test user are randomly selected 
as the personalized training set for personalization for the user. Once the 
model is created, another 50 clips annotated by the same user are randomly 
selected. Specifically, for each test user, a general MER model is trained with 
600 clips randomly selected from the original AMG1608, excluding those 
annotated by the test user under consideration and those self-inconsistent 
annotations. Then, the general model is incrementally personalized five times 
using different numbers of clips selected from the personalized training set. 
We use 10, 20, 30, 40, and 50 clips iteratively, with the preceding clips being 
a subset of the current ones each time. The process is repeated 10 times for 
each user. 

We use the following evaluation metrics here: the AED, the and 
the average likelihood (ALH) of generating the ground-truth annotation (a 
single VA point) e* of the test user using the predicted affective Gaussian, 
i.e. p(e* I /i*, S*). Larger ALH corresponds to better accuracy. We do not 
report KL divergence here because each clip in the dataset is annotated by 
a user at most once, which does not constitute a probability distribution. 


5.4.2 Result 


We compare the MAP-based personalization method of AEG with the two- 
stage personalization method of SVR proposed in 78 . In the two-stage SVR 


method, the first stage creates a general SVR model for general emotion pre¬ 
diction, whereas the second stage creates a personalized SVR that is trained 
solely on a user’s annotations. The hnal prediction is obtained by linearly 
combining the predictions from the general SVR and the personalized SVR 
with weights 0.7 and 0.3, respectively. The weights are derived empirically 
according to our pilot study. As for AEG, we only update the mean pa¬ 
rameters with = 0.01, because our pilot study shows that updating the 
covariance empirically does not lead to better performance. This observa¬ 
tion is also in line with the findings in speaker adaptation 44 . We train the 
background model with AEG Uniform for simplicity. 
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(a) ALH, larger-better. 


(b) AED, smaller-better. 



(c) E? of valence, larger-better. 



(d) B? of arousal, larger-better. 


Figure 4: Performance evaluation on personalized MER, with varying num¬ 
bers of personal data. 


Figure compares the result of different personalized MER methods, 
when we vary the number of available personal annotations. The starting 
point of each curve is the result given by the general MER model trained 
on partial users of the AMG1608 dataset. We can see that the result of 
the general model is inferior to those reported in Figure showing that a 
general MER model is less effective when it is used to predict the emotion 
perception of individual users, compared to the case of predicting the av¬ 
erage emotion perception of users. We can also see that the result of the 
considered personalized methods generally grows as the number of personal 
annotations increases. When the value of K is sufficiently large, AEG-based 
personalization methods can outperform the SVR method. Moreover, while 
the result of SVR starts to saturate when the number of personal annotations 
is larger than 20, AEG has the potential of keeping on improving the perfor- 
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(a) Point-based 



(b) Trajectory-based 


Figure 5: The stress-sensitive user interface for emotion-based music re¬ 
trieval. Users can (a) specify a point or (b) draw a trajectory, while specifying 
the variance with different levels of duration. 


mance by exploiting more personal annotations. We also note that there is 
no signihcant performance difference for AEG when K is large enough (e.g. 
> 128). 

Although our evaluation shows that personalization methods can improve 
the result of personalized emotion prediction, the low values in the statis¬ 
tics for valence and arousal still show that the problem is fairly challenging. 
Future work is still needed to improve either the quality of the emotion an¬ 
notation data or the feature extraction or machine learning algorithms for 
modeling emotion perception. 


6 Emotion-based Music Retrieval 

6.1 The VA-oriented Query Interface 

The VA space offers a ready canvas for music retrieval through the specihca- 
tion of a point in the emotion space [^. Users can retrieve music pieces of 
certain emotions without specifying the titles. Users can also draw a trajec¬ 
tory to indicate the desired emotion changes across a list of songs (e.g. from 
angry to tender). 

In addition to the above point-based query, one can also issue a Gaussian- 
based query to an AEG-based retrieval system. As Figure shows, users can 
specify the desired variances (or the conhdence level at the center point) of 


23 









Table 3: The two approaches of the emotion-based music retrieval system 


Approach 

Indexing phase 

Indexed type 

Matching phase 

Emotion 

Prediction 

full procedure of 
MER by AEG 

an affective GMM 
(Eq. 231 or a 2-dim 
Gaussian {/i, S} 

likelihood (for point 

query) or distance (for 

Gaussian query) 

Folding-In 

compute only the 
topic posterior 

AT-dim vector 6 

cosine similarity of pseudo 
song (AT-dim vector A) 


emotion by pressing a point in the VA space with different levels of dura¬ 
tion or strength. The variance of the Gaussian gets smaller as one increases 
the duration or strength of pressing, as Figure (a) shows. Larger vari¬ 
ances indicate less specihc emotion around the center point. After specifying 
the size of a circular variance shape, one can even pinch hngers to adjust 
the variance shape. For a trajectory-based query input, similarly, the cor¬ 
responding variances are determined according to the dynamic speed when 
drawing the trajectory, as Figure (b) shows. Fast speed corresponds to a 
less specihc query and the system will return pieces whose variances of emo¬ 
tion are larger. If songs with more specihc emotions are desirable, one can 
slow down the speed when drawing the trajectory. The queries inputted by 
such a stress-sensitive interface can be handled by AEG for emotion-based 
music retrieval. 

6.2 Overview of the Emotion-based Music Retrieval 
System 

As Figure shows, the content-based retrieval system can be divided into 
two phases. In the feature indexing phase, we index each music clip in an un¬ 
labeled music database by one of the following two approaches: The emotion 
prediction approach indexes a clip with the predicted emotion distribution (an 
ahective GMM or a single 2-D Gaussian) given by MER, whereas the folding- 
in approach indexes a clip with the topic posterior (a iF-dimensional vector). 
In the later music retrieval phase, given an arbitrary emotion-oriented query 
the system returns a list of music clips ranked according to one of the fol¬ 
lowing two approaches: likelihood/distance-based matching and pseudo song- 
based matching. These two ranking approaches correspond to one of the two 
indexing approaches, respectively, as summarized in Table We present the 
details of the two approaches in the following subsections. 
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Figure 6: The diagram of the content-based music retrieval system using an 
emotion query. 


6.3 The Emotion Prediction-based Approach 

This approach indexes each clip as a single, representative Gaussian distri¬ 
bution or an affective GMM in the offline MER procedure. The query is 
then used to compare with the predicted emotion distribution of each clip 
in the database. The system ranks all the clips based on the likelihoods or 
distances in response to the query. Glips with larger likelihood or smaller 
distance should be placed in the higher order. 

Given a point query e, the corresponding likelihood of the indexed emo¬ 
tion distribution of a clip Oi is generated by a single Gaussian p{e \ fii, Sj) 
or an affective GMM p(e | Oi) (cf. Eq. 23), where is the predicted 

parameters of the representation Gaussian for and ^ is the fc-th com¬ 
ponent of Oi. Note that here we use the topic posterior vector to represent a 
clip in the database. 

When it comes to a Gaussian-based query G = the approach 


generates the ranking scores based on the K 
dexing with a single Gaussian, we use Eq. 
between the query and a clip. On the o 


30 


-i2 divergence. In the case of in¬ 
to compute Dkl 2 {G, G{fii, Sj)) 


;her hand, in the case of indexing 
with an affective GMM, we compute the weighted KL2 divergence by 


K 


DkL2 (G,p(e I 0i)) = ^i,fc-DKL2 


k=l 


(32) 
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Figure 7: Illustration of the Folding-In process of emotion-based music re¬ 
trieval by AEG. 


6.4 The Folding-In-based Approach 

As Figure shows, this approach estimates the probability distribution A = 
subject to ~ input VA-oriented query in an online 

manner. Each estimated corresponds to the relevance of a query to the 
/c-th latent topic Zk, so we can treat the distribution of A as the topic pos¬ 
terior of the query and call it a pseudo song. In the case of Figure for 
example, we show a query that is very likely to be represented by the 2-nd 
affective Gaussian component. The folding-in process is likely to assign a 
dominative weight A2 = 1 for Z 2 , and A^ = 0, Vh 7^ 2. This implies that the 
query is highly related to the song whose topic posterior is dominated by 6*2. 
Therefore, the pseudo song can be used to match with the topic posterior 
vector 6i of each clip in the database. 

Given a point query e, we start the folding-in process by first generating 
the pseudo song via maximizing the query likelihood of the A-weighted af¬ 
fective GMM with respective to A. By taking the logarithm of Eq. we 
obtain the following objective function. 


K 


max 

A 


log^Aji Gk{e I /tfc, Sj 


(33) 


k=l 


where A^ is the fc-th component of the vector A. In some sense, a good A will 
make the corresponding A-weighted affective GMM well generate the query 


e. The problem in Eq. ^ can be solved by the EM algorithm. In the E-step, 
the posterior probability of Zk is computed by 


p{zk 


e = 


AfcGfc(e I Hky Sfc) 
J2h=l ^hGh{G I At/i, S/i) 


(34) 
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In the M-step, we then only update Xk by 

K ^ PUk I S) . 


(35) 


As for a Gaussian-based query G, we fold in the query into the learned 
affective GMM to estimate a pseudo song as well. This time, we maximize 
the following log-likelihood function, 


K 


max 

A 


logJ2 ^kP(G\Gt), 


(36) 


k=l 


where p{G \ Gk) is the likelihood function based on KL2 (cf. Eq. 30): 

p{G I Gk) = exp ( — T*kl 2 (G, Gk)) ■ (37) 

Again, Eq. I^can be solved by the EM algorithm, with the following update. 


^ Pi^k I G) — 


XkPjG I Gk) 

Eti ^kP(G I Gk) 


(38) 


The EM processes for both point- and Gaussian-based queries stop early 
after few iterations (e.g. 3), because the pseudo song estimation is sensi¬ 
tive to over-htting. Several initialization settings can be used, such as a 
random, uniform, or prior distribution. Gonsidering the stability and the 
reproducibility of the experimental result, we opt for using a uniform dis¬ 
tribution for initialization. Note that random initialization may introduce 
discrepant results among different trials even with identical experimental 
settings, whereas initializing with a prior distribution may render biased re¬ 
sults in favor of songs that predominates the training data 63 . Finally, the 


retrieval system ranks all the clips in descending order of the following cosine 
similarities in response to the pseudo song: 


$(A,0.) = 


IAIIII6/J 


(39) 


6.5 Discussion 

The Emotion Prediction approach is straightforward, as the purpose of MER 
is to automatically index unseen music pieces in the database. In contrast, 
the Folding-In approach goes one step further to embed a VA-based query 
into the space of music clips. Although the folding-in process requires an 
additional step of estimating the pseudo song, it is in fact more flexible. In 
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a personalized music retrieval context, for example, a personalized affective 
GMM can readily produce a personalized pseudo song for comparing with 
the original topic posterior vectors of all the pieces in the database, without 
the need to predict the emotion again with the personalized model. 

The complexity of the Emotion Prediction approach mainly comes from 
computing the likelihood of a point query on each music clip’s emotion dis¬ 
tribution or the KL divergence between the Gaussian query and the emotion 
distribution of each clip. Therefore, the matching process needs to compute 
N (the number of clips in the database) times the likelihood or the KL diver¬ 
gence. In the Folding-In approach, the complexity comes from estimating the 
pseudo song (with the EM algorithm) and computing the cosine similarity 
between the pseudo song and each clip. EM needs to compute K x ITER 
times the likelihood of a component affective Gaussian or the Gaussian KL 
divergence, where ITER is the number of EM iterations. Then, the match¬ 
ing process computes N times the cosine similarity. Obviously, computing 
the likelihood on an emotion distribution (i.e. a single Gaussian or a GMM) 
is computationally more expensive than computing the cosine similarity (as 
K is usually not large). Therefore, when N is large (e.g. N ^ K x ITER), 
the Folding-In approach is considered as a more feasible one in practice. 

6.6 Evaluation for Emotion-based Music Retrieval 

6.6.1 Evaluation Setup 

The AMG1608 dataset is again adopted in this music retrieval evaluation. 
We consider two emotion-based music retrieval scenarios: query-by-point and 
query-by-Gaussian. For each scenario, we create a set of synthetic queries 
and use the learned AEG model to respond to each test query and return a 
ranked list of music clips from an unlabeled music database. The generation 
of the test query set for query-by-point is simple. As Figure (a) shows, 
we uniformly sample 100 2-D query points within [[—1, —1]^, [1,1]^] in the 
VA space. The test query set for query-by-Gaussian is then based on this 
set of points. Specihcally, we convert a point query to a Gaussian query 
by associating with the point a 2-by-2 covariance matrix, as Figure i(b) 
shows. Motivated by our empirical observation from data, the covariance of 
a Gaussian query is set in inverse proportion to the distance between the 
mean of the Gaussian query (determined by the corresponding point query) 
and the origin of the VA space. That is, if a given point query is far from the 
origin (with large emotion magnitude), the user may want to retrieve songs 
with a specihc emotion (with smaller covariance ellipse). 

The performance is evaluated by aggregating the ground truth relevance 



(a) (b) 

Figure 8: Test queries used in evaluating emotion-based music retrieval: (a) 
100 points generated uniformly in between [-1, 1], (b) 100 Gaussians gener¬ 
ated based on the previous 100 points. 


scores of the retrieved music clips according to the normalized discounted 
cumulative gain (NDCG), a widely used performance measure for ranking 
problems 25 . The NDGG@P, which measures the relevance of the top P 


retrieved clips for a query, is computed by 


NDGGOP = — <; P(l) 
Zp 


E 

i=2 


m 

l0g2 


(40) 


where R[i) is the ground truth relevance score of the rank-i clip, t — 1 ,..., Q, 
where Q > P is the number of clips in the music database, and Zp is the 
normalization term that ensures the ideal NDGG@P equal 1. Let Mi (with 
parameters {uj, Bj}) denote the ground-truth annotation Gaussian of the 
rank-i clip. For a point query e, R{i) is obtained by p(e | aj,Bj), the likeli¬ 
hood of the query point. For a Gaussian query A/", R{i) is given by p{M \ Mi) 
defined by Eq. 


37 From Eq. 40 


we see that if the system ranks the clips 
in similar order as the descending order obtained on we obtain a 

larger NDGG. We report the average NDGG computed over the test query 
set. Note that we do not adopt evaluation metrics, such as the mean av¬ 
erage precision (MAP) and the area under the ROG curve (AUG), because 
currently it is not trivial to set a threshold to binarize R{i). 

We perform three-fold cross-validation as that used in evaluating general 
MER. In each round, the test fold (with 536 clips) serves as the unlabeled 
music database. 
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(a) Point-based query, larger-better. (b) Gaussian-based query, larger-better. 
Figure 9: Evaluation result of emotion-based music retrieval. 


6.6.2 Result 


We implement a Random approach to reflect the lower bound performance by 
using a random permutation for each test query, without taking into consid¬ 
eration any ranking approach. We further implement an Ensemble approach 
that averages the rankings of a test query given by Emotion Prediction and 
Folding-In. Specihcally, both approaches assign an ordinal number to a clip 
according to their respective rankings. Then, we average the two ordinal 
numbers of a clip as a new score, and re-rank all the clips in ascending order 
of their new scores. 

Note that we only consider AEG Uniform for simplicity in the result 
presentation. Our preliminary study reveals that AEG Uniform in general 
perform slightly better than AEG AnnoPrior and the hybrid method men¬ 
tioned in Section 5.3.4 in the retrieval task. Moreover, for the Folding-In 


approach, early stop is not only important to the folding-in process, but also 
necessary to learning the affective GMM. According to our pilot study, set¬ 
ting ITER = 2 — 4 for learning affective GMM and ITER = 3 for learning 
the pseudo song lead to the optimal performance. 

Figures. (a) and (b) compare the NDGG@5 of the Emotion Predic¬ 
tion and Folding-In approaches to emotion-based music retrieval using either 
point-based or Gaussian-based queries. We are interested in how the result 
changes as we vary the number of latent topics. It can be found that the two 
approaches perform very similarly for point-based query when K is in be¬ 
tween 64 and 256. Moreover, we see that Emotion Prediction can outperform 
Folding-In for Gaussian-based query when K is sufficiently large {K > 64). 
The optimal model is attained when K = 128 in all cases. Similar to the 
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Table 4: The query-by-point retrieval performance in terms of NDCG@5, 10, 
20, and 30. 


Method 

P = 5 

P = 10 

P = 20 

II 

CO 

o 

Random 

0.1398 

0.1504 

0.1666 

0.1804 

Emotion Prediction 

0.3907 

0.4027 

0.4288 

0.4490 

Folding-In 

0.3868 

0.4067 

0.4333 

0.4533 

Ensemble 

0.3954 

0.4129 

0.4398 

0.4601 


Table 5: The query-by-Gaussian retrieval performance in terms of NDCG@5, 
10, 20, and 30. 


Method 

P = 5 

P = 10 

P = 20 

II 

CO 

o 

Random 

0.1032 

0.1090 

0.1185 

0.1272 

Emotion Prediction 

0.3143 

0.3306 

0.3481 

0.3658 

Folding-In 

0.2932 

0.3147 

0.3383 

0.3532 

Ensemble 

0.3204 

0.3368 

0.3601 

0.3783 


result in General MER, it seems that setting K either too large or too small 
would lead to sub-optimal result. 

Tables 1^ and 1^ present the result of NDGG@5, 10, 20, and 30 for differ¬ 
ent retrieval methods, including the random baseline. Emotion Prediction, 
Folding-In, and the Ensemble approaches. The latter three use AEG Uni¬ 
form with K = 128. It is obvious that the latter three can signihcantly 
outperform the random baseline, demonstrating the effectiveness of AEG 
in emotion-based music retrieval. It can also be found that the Ensemble 
approach leads to the best result. 

A closer comparison between Emotion Prediction and Folding-In for point- 
based query shows nip and tuck, whereas the former performs consistently 
better regardless of the value of P for Gaussian-based query. Moreover, the 
NDGG measure seems more favorable for point-based query than Gaussian- 
based one. Our observation indicates that the standard deviation of the 
ground truth relevance scores (i.e. for Gaussian-based query is 

much larger, resulting in a more challenging measurement basis than that 
for point-based query. However, the relative performance difference between 
the two methods is similar for point-based and Gaussian-based queries. 


7 Conclusion 

AEG is a principled probabilistic framework that nicely unifies the compu¬ 
tation processes for MER and emotion-based music retrieval for dimensional 
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emotion representations such as valence and arousal. Moreover, AEG bet¬ 
ter takes into account the subjective nature of music emotional responses 
through the use of probabilistic inference and model adaptation, further mak¬ 
ing it possible to personalize an emotion-based MIR system. The codes for 
implementing AEG can be retrieved from the link below: http: //slam. iis. 
sinica.edu.tw/demo/AEG/. 

Despite that AEG is a powerful approach, there remains a number of 
challenges for MER, including: 


• Is it the best way to consider the valence-arousal space as a coordinate 
space (with two orthogonal axes)? 

• How do we dehne the “intensity” of emotion? Does the magnitude of 
a point in the emotion space implies intensity? Would it be possible to 
train regressors that treat the emotion space as a polar coordinate? 


What are the features that are more important for modeling emotion? 


Gross genre generazability 13 


Gross culture generazability 21 


How to incorporate lyrics features for MER? 


• How to model the effect of the singing voice in emotion perception? 


• How do Endings in MER help emotion-based music synthesis or ma¬ 
nipulation? 


We note that AEG is only suitable for an emotion-based MIR system 
when we characterize emotions in terms of valence and arousal. It does 
not apply to systems that use categorical mood tags to describe emotion. A 
corresponding probabilistic model for categorical MER is yet to be developed. 
More research eEorts are also needed for the personalization and retrieval 
aspects for categorical MER. 

The AEG model itself can also be improved in a number of directions. For 
example, there are several alternative methods that one can adopt to enhance 
the latent acoustic descriptors (i.e. in Section]^ for clip-level topic 

poster representation, such as deep learning 50 or sparse representations 


57 . One can also perform discriminative training to reduce the prediction 


error by using the same corpus with respect to the selection of Gaussian 
components or parameter refinement over the affective GMM. For example, a 
stacked discriminative learning on the parameters initialized by a EM-learned 
generative model has been studied for years in speech recognition [9, 26 
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Following this research line, it may help improve AEG as well. Finally, the 
AEG framework can be easily extended to include multi-modal content such 
as lyrics, review comments, album cover, and music video. For instance, 
given a silent video sequence, one can accompany it with a piece of music 
based on music emotion 
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